11 research outputs found
Unbiased Learning to Rank: Counterfactual and Online Approaches
This tutorial covers and contrasts the two main methodologies in unbiased
Learning to Rank (LTR): Counterfactual LTR and Online LTR. There has long been
an interest in LTR from user interactions, however, this form of implicit
feedback is very biased. In recent years, unbiased LTR methods have been
introduced to remove the effect of different types of bias caused by
user-behavior in search. For instance, a well addressed type of bias is
position bias: the rank at which a document is displayed heavily affects the
interactions it receives. Counterfactual LTR methods deal with such types of
bias by learning from historical interactions while correcting for the effect
of the explicitly modelled biases. Online LTR does not use an explicit user
model, in contrast, it learns through an interactive process where randomized
results are displayed to the user. Through randomization the effect of
different types of bias can be removed from the learning process. Though both
methodologies lead to unbiased LTR, their approaches differ considerably,
furthermore, so do their theoretical guarantees, empirical results, effects on
the user experience during learning, and applicability. Consequently, for
practitioners the choice between the two is very substantial. By providing an
overview of both approaches and contrasting them, we aim to provide an
essential guide to unbiased LTR so as to aid in understanding and choosing
between methodologies.Comment: Abstract for tutorial appearing at SIGIR 201
Safe Exploration for Optimizing Contextual Bandits
Contextual bandit problems are a natural fit for many information retrieval
tasks, such as learning to rank, text classification, recommendation, etc.
However, existing learning methods for contextual bandit problems have one of
two drawbacks: they either do not explore the space of all possible document
rankings (i.e., actions) and, thus, may miss the optimal ranking, or they
present suboptimal rankings to a user and, thus, may harm the user experience.
We introduce a new learning method for contextual bandit problems, Safe
Exploration Algorithm (SEA), which overcomes the above drawbacks. SEA starts by
using a baseline (or production) ranking system (i.e., policy), which does not
harm the user experience and, thus, is safe to execute, but has suboptimal
performance and, thus, needs to be improved. Then SEA uses counterfactual
learning to learn a new policy based on the behavior of the baseline policy.
SEA also uses high-confidence off-policy evaluation to estimate the performance
of the newly learned policy. Once the performance of the newly learned policy
is at least as good as the performance of the baseline policy, SEA starts using
the new policy to execute new actions, allowing it to actively explore
favorable regions of the action space. This way, SEA never performs worse than
the baseline policy and, thus, does not harm the user experience, while still
exploring the action space and, thus, being able to find an optimal policy. Our
experiments using text classification and document retrieval confirm the above
by comparing SEA (and a boundless variant called BSEA) to online and offline
learning methods for contextual bandit problems.Comment: 23 pages, 3 figure
To Model or to Intervene: A Comparison of Counterfactual and Online Learning to Rank from User Interactions
Learning to Rank (LTR) from user interactions is challenging as user feedback
often contains high levels of bias and noise. At the moment, two methodologies
for dealing with bias prevail in the field of LTR: counterfactual methods that
learn from historical data and model user behavior to deal with biases; and
online methods that perform interventions to deal with bias but use no explicit
user models. For practitioners the decision between either methodology is very
important because of its direct impact on end users. Nevertheless, there has
never been a direct comparison between these two approaches to unbiased LTR. In
this study we provide the first benchmarking of both counterfactual and online
LTR methods under different experimental conditions. Our results show that the
choice between the methodologies is consequential and depends on the presence
of selection bias, and the degree of position bias and interaction noise. In
settings with little bias or noise counterfactual methods can obtain the
highest ranking performance; however, in other circumstances their optimization
can be detrimental to the user experience. Conversely, online methods are very
robust to bias and noise but require control over the displayed rankings. Our
findings confirm and contradict existing expectations on the impact of
model-based and intervention-based methods in LTR, and allow practitioners to
make an informed decision between the two methodologies.Comment: SIGIR 201
Generate, Filter, and Fuse: Query Expansion via Multi-Step Keyword Generation for Zero-Shot Neural Rankers
Query expansion has been proved to be effective in improving recall and
precision of first-stage retrievers, and yet its influence on a complicated,
state-of-the-art cross-encoder ranker remains under-explored. We first show
that directly applying the expansion techniques in the current literature to
state-of-the-art neural rankers can result in deteriorated zero-shot
performance. To this end, we propose GFF, a pipeline that includes a large
language model and a neural ranker, to Generate, Filter, and Fuse query
expansions more effectively in order to improve the zero-shot ranking metrics
such as nDCG@10. Specifically, GFF first calls an instruction-following
language model to generate query-related keywords through a reasoning chain.
Leveraging self-consistency and reciprocal rank weighting, GFF further filters
and combines the ranking results of each expanded query dynamically. By
utilizing this pipeline, we show that GFF can improve the zero-shot nDCG@10 on
BEIR and TREC DL 2019/2020. We also analyze different modelling choices in the
GFF pipeline and shed light on the future directions in query expansion for
zero-shot neural rankers
Regression Compatible Listwise Objectives for Calibrated Ranking
As Learning-to-Rank (LTR) approaches primarily seek to improve ranking
quality, their output scores are not scale-calibrated by design -- for example,
adding a constant to the score of each item on the list will not affect the
list ordering. This fundamentally limits LTR usage in score-sensitive
applications. Though a simple multi-objective approach that combines a
regression and a ranking objective can effectively learn scale-calibrated
scores, we argue that the two objectives can be inherently conflicting, which
makes the trade-off far from ideal for both of them. In this paper, we propose
a novel regression compatible ranking (RCR) approach to achieve a better
trade-off. The advantage of the proposed approach is that the regression and
ranking components are well aligned which brings new opportunities for
harmonious regression and ranking. Theoretically, we show that the two
components share the same minimizer at global minima while the regression
component ensures scale calibration. Empirically, we show that the proposed
approach performs well on both regression and ranking metrics on several public
LTR datasets, and significantly improves the Pareto frontiers in the context of
multi-objective optimization. Furthermore, we evaluated the proposed approach
on YouTube Search and found that it not only improved the ranking quality of
the production pCTR model, but also brought gains to the click prediction
accuracy
RD-Suite: A Benchmark for Ranking Distillation
The distillation of ranking models has become an important topic in both
academia and industry. In recent years, several advanced methods have been
proposed to tackle this problem, often leveraging ranking information from
teacher rankers that is absent in traditional classification settings. To date,
there is no well-established consensus on how to evaluate this class of models.
Moreover, inconsistent benchmarking on a wide range of tasks and datasets make
it difficult to assess or invigorate advances in this field. This paper first
examines representative prior arts on ranking distillation, and raises three
questions to be answered around methodology and reproducibility. To that end,
we propose a systematic and unified benchmark, Ranking Distillation Suite
(RD-Suite), which is a suite of tasks with 4 large real-world datasets,
encompassing two major modalities (textual and numeric) and two applications
(standard distillation and distillation transfer). RD-Suite consists of
benchmark results that challenge some of the common wisdom in the field, and
the release of datasets with teacher scores and evaluation scripts for future
research. RD-Suite paves the way towards better understanding of ranking
distillation, facilities more research in this direction, and presents new
challenges.Comment: 15 pages, 2 figures. arXiv admin note: text overlap with
arXiv:2011.04006 by other author
Query Expansion by Prompting Large Language Models
Query expansion is a widely used technique to improve the recall of search
systems. In this paper, we propose an approach to query expansion that
leverages the generative abilities of Large Language Models (LLMs). Unlike
traditional query expansion approaches such as Pseudo-Relevance Feedback (PRF)
that relies on retrieving a good set of pseudo-relevant documents to expand
queries, we rely on the generative and creative abilities of an LLM and
leverage the knowledge inherent in the model. We study a variety of different
prompts, including zero-shot, few-shot and Chain-of-Thought (CoT). We find that
CoT prompts are especially useful for query expansion as these prompts instruct
the model to break queries down step-by-step and can provide a large number of
terms related to the original query. Experimental results on MS-MARCO and BEIR
demonstrate that query expansions generated by LLMs can be more powerful than
traditional query expansion methods.Comment: 7 pages, 2 figure